Customer Segmentation with PCA and Clustering¶
Project Overview¶
This project focuses on the implementation of Principal Component Analysis (PCA) and clustering techniques to gain valuable insights into a grocery firm's customer database. The ultimate goal is to perform customer segmentation, a crucial practice that involves categorizing customers into distinct groups based on shared characteristics. By doing so, we aim to enhance our understanding of customer behavior and tailor our products and services to meet their diverse needs and preferences.
Dataset Overview¶
Intduction:¶
The dataset at hand represents a snapshot of a grocery store's customer data. The company is keen on segmenting its customers to better understand their behavior and preices to different customer segments.
This refined project outline provides a structured and comprehensive overview of the goals, steps, and expected outcomes of the customer segmentation project using PCA and clusting techniques.
Linto dataset:¶
https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis
Dataset Description¶
There are four groups of information:
| People | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | Complain | |
| Customer's unique identifier | Customer's birth year | Customer's education level | Customer's marital status | Customer's yearly household income | Number of children in customer's household | Number of teenagers in customer's household | Date of customer's enrollment with the company | Number of days since customer's last purchase | 1 if the customer complained in the last 2 years, 0 otherwise |
Products | | | | | | | --- | --- | --- | --- | --- | --- | --- | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | Amount spent on wine in the last 2 years | Amount spent on fruits in the last 2 years | Amount spent on meat in the last 2 years | Amount spent on fish in the last 2 years | Amount spent on sweets in the last 2 years | Amount spent on gold in the last 2 years |
Promotion | | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | NumDealsPurchases | AcceptedCmp1 | AcceptedCmp2 | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | Response | Number of purchases made with a discount | 1 if the customer accepted the offer in the 1st campaign, 0 otherwise | 1 if the customer accepted the offer in the 2nd campaign, 0 otherwise | 1 if the customer accepted the offer in the 3rd campaign, 0 otherwise | 1 if the customer accepted the offer in the 4th campaign, 0 otherwise | 1 if the customer accepted the offer in the 5th campaign, 0 otherwise | 1 if the customer accepted the offer in the last campaign, 0 otherwise |
Place | | | | | --- | --- | --- | --- | --- | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | Number of purchases made through the company’s website | Number of purchases made using a catalogue | Number of purchases made directly in stores | *Number of visits to the
Key Steps in the Project¶
Data Exploration:
- Explore the structure and characteristics of the dataset to gain initial insights.
Data Preprocessing:
- Handle missing values, outliers, and any necessary data transformations.
Principal Component Analysis (PCA):
- Implement PCA to reduce the dimensionality of the dataset while retaining essential information.
Clustering:
- Utilize clustering algorithms (e.g., K-means) to segment customers based on their purchasing behavior.
Interpretation:
- Interpret the results of the clustering and visualize customer segments for better understanding.
Business Recommendations:
- Provide actionable recommendations based on the identified customer segments to enhance marketing and service strategies.
Expected Outcomes¶
- A clear segmentation of customers into distinct groups.
- Insights into the key features driving customer segmentation.
- Practical recommendations for tailoring products and services to different customer segments.company’s website in the las## Table of Contents
-
- 3.1. Encoding
- 3.2. Scale the Data
**Final conclusion**
import pandas as pd
import numpy as np
pd.set_option('display.float_format', '{:.2f}'.format)
pd.set_option('display.max_columns', None)
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
sns.set_theme(style='white')
%config InlineBackend.figure_format = 'retina'
plt.rcParams["axes.spines.right"] = False
plt.rcParams["axes.spines.top"] = False
#import warnings
#warnings.filterwarnings("ignore")
1.2. Get basic information about the dataset¶
original_df = pd.read_csv('marketing_campaign.csv', delimiter='\t')
print(original_df.shape)
original_df.head()
(2240, 29)
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.00 | 0 | 0 | 04-09-2012 | 58 | 635 | 88 | 546 | 172 | 88 | 88 | 3 | 8 | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.00 | 1 | 1 | 08-03-2014 | 38 | 11 | 1 | 6 | 2 | 1 | 6 | 2 | 1 | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.00 | 0 | 0 | 21-08-2013 | 26 | 426 | 49 | 127 | 111 | 21 | 42 | 1 | 8 | 2 | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.00 | 1 | 0 | 10-02-2014 | 26 | 11 | 4 | 20 | 10 | 3 | 5 | 2 | 2 | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 4 | 5324 | 1981 | PhD | Married | 58293.00 | 1 | 0 | 19-01-2014 | 94 | 173 | 43 | 118 | 46 | 27 | 15 | 5 | 5 | 3 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
df = original_df.copy()
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2216 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Z_CostContact 2240 non-null int64 27 Z_Revenue 2240 non-null int64 28 Response 2240 non-null int64 dtypes: float64(1), int64(25), object(3) memory usage: 507.6+ KB
Change Year_Enrolled to datetime¶
df['Year_Enrolled'] = pd.to_datetime(df['Dt_Customer'], format='%d-%m-%Y')
missing_encountered = False
for col in df.columns:
missing = df[col].isnull().sum()
if missing > 0:
print(f'{col}: {missing} missing values.')
missing_encountered = True
if not missing_encountered:
print("No missing values.")
Income: 24 missing values.
The dataset comprises 29 columns, out of which 25 are integer variables, 3 are objects, and 1 is a float. And Income variable is the only one having missing values.
df['Income'] = df['Income'].fillna(df['Income'].median())
1.3. Examine unique values of each column¶
data = []
for col in df.columns:
unique_values = df[col].unique()
num_unique = len(unique_values)
dtype = df[col].dtype
if num_unique > 10:
data.append((col, num_unique, "", dtype))
else:
unique_values_str = ", ".join(map(str, unique_values))
data.append((col, num_unique, unique_values_str, dtype))
uniques = pd.DataFrame(data, columns=["Column", "# of Uniques", "Unique Values (<10)", "Dtype"])
uniques.sort_values(by='# of Uniques').reset_index(drop=True)
| Column | # of Uniques | Unique Values (<10) | Dtype | |
|---|---|---|---|---|
| 0 | Z_Revenue | 1 | 11 | int64 |
| 1 | Z_CostContact | 1 | 3 | int64 |
| 2 | AcceptedCmp2 | 2 | 0, 1 | int64 |
| 3 | Response | 2 | 1, 0 | int64 |
| 4 | AcceptedCmp1 | 2 | 0, 1 | int64 |
| 5 | AcceptedCmp5 | 2 | 0, 1 | int64 |
| 6 | AcceptedCmp3 | 2 | 0, 1 | int64 |
| 7 | Complain | 2 | 0, 1 | int64 |
| 8 | AcceptedCmp4 | 2 | 0, 1 | int64 |
| 9 | Teenhome | 3 | 0, 1, 2 | int64 |
| 10 | Kidhome | 3 | 0, 1, 2 | int64 |
| 11 | Education | 5 | Graduation, PhD, Master, Basic, 2n Cycle | object |
| 12 | Marital_Status | 8 | Single, Together, Married, Divorced, Widow, Al... | object |
| 13 | NumCatalogPurchases | 14 | int64 | |
| 14 | NumStorePurchases | 14 | int64 | |
| 15 | NumDealsPurchases | 15 | int64 | |
| 16 | NumWebPurchases | 15 | int64 | |
| 17 | NumWebVisitsMonth | 16 | int64 | |
| 18 | Year_Birth | 59 | int64 | |
| 19 | Recency | 100 | int64 | |
| 20 | MntFruits | 158 | int64 | |
| 21 | MntSweetProducts | 177 | int64 | |
| 22 | MntFishProducts | 182 | int64 | |
| 23 | MntGoldProds | 213 | int64 | |
| 24 | MntMeatProducts | 558 | int64 | |
| 25 | Year_Enrolled | 663 | datetime64[ns] | |
| 26 | Dt_Customer | 663 | object | |
| 27 | MntWines | 776 | int64 | |
| 28 | Income | 1975 | float64 | |
| 29 | ID | 2240 | int64 |
1.4. Summary of the dataset¶
1 Unique Value: Variables
Z_RevenueandZ_CostContactexhibit a singular value (11 and 3 respectively), implying no contribution to our clustering analysis. Consequently, I will remove them from the dataset.2 Unique Values: Variables
AcceptedCmp1,AcceptedCmp2, ... toAcceptedCmp5,Complain, andResponseadopt binary options 0 and 1. I intend to visualize them to assess the potential for combining these variables.3 Unique Values: Variables
TeenhomeandKidhomecan be combined into a single variable namedDependents, which implies number of dependents in a family.Education: TheEducationvariable entails five categories: 'Graduation', 'PhD', 'Master', 'Basic', and '2n Cycle'. After researching the significance of 'Basic' and '2n Cycle' education levels:- 'Basic' signifies foundational or primary education. This category will be retained as is.
- '2n Cycle' represents advanced studies beyond a bachelor's degree (1st Cycle). It will be merged with the 'Master' category.
Marital_Status: With eight categories, namely Single, Together, Married, Divorced, Widow, Absurb, YOLO, I propose merging them into two categories: 'Relationship' and 'Single'.ID: This variable is deemed non-contributory and will be excluded from further analysis.Year_Birth: To enhance interpretability, I will transform theYear_Birthvariable intoAgeusing the maximum engagement day as a reference threshold.Year_Enrolled: Similar toYear_Birth, I will create a new variable namedDays_Enrolledby employing the same methodology.Other categorical variables that encompass more than 10 options will remain unaltered.
categorical_cols = [
'Kidhome',
'Teenhome',
'AcceptedCmp1',
'AcceptedCmp2',
'AcceptedCmp3',
'AcceptedCmp4',
'AcceptedCmp5',
'Complain',
'Response',
'Education',
'Marital_Status'
]
fig, axes = plt.subplots(3, 4, figsize=(16, 12))
for i, col in enumerate(categorical_cols):
order = df[col].value_counts().index
ax = axes[i // 4, i % 4]
sns.countplot(data=df, x=col, ax=ax, order=order, palette='magma', hue=col, legend=False)
ax.set_title(col)
ax.set_xlabel('')
ax.set_ylabel('')
if col == 'Marital_Status':
ax.tick_params(axis='x', rotation=45)
plt.tight_layout()
2.1.1. Summary of the Countplot¶
The countplot analysis reveals significant imbalances in the distribution of values within categorical variables.
KidhomeandTeenhome- Combining
KidhomeandTeenhomemight be beneficial for a more consolidated representation of the number of children in different age groups.
- Combining
AcceptedCmp1toAccepteCmp5- Considering combining
AcceptedCmp1toAcceptedCmp5due to their apparent similarity, potentially simplifying the analysis and interpretation.
- Considering combining
EducationandMarita_Status- There is a need for cleaning or grouping within the
EducationandMarital_Statusvariables to address potential inconsistencies or improve categorization.
- There is a need for cleaning or grouping within the
Complaints and Surey Response
- Most customers did not register any complaints.
- A relatively small proportion of the total customer base (around 300 out of 2240) responded to the survey in the
Response
Below is more detailed analysis to each variable:
2.1.2. Examine Complain¶
df['Complain'].value_counts()
Complain 0 2219 1 21 Name: count, dtype: int64
The highly imbalanced distribution of the Complain variable (21 occurrences of 1 and 2219 occurrences of 0) might not contribute meaningful insights to customer segmentation so I will drop this variable.
2.1.3. Examine AcceptedCmp1, AcceptedCmp2, AcceptedCmp3, AcceptedCmp4 and AcceptedCmp5¶
accepted = df[['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5']]
plt.figure(figsize=(8, 6))
sns.heatmap(accepted, cbar=False)
plt.title('Accepted Campaigns');
2.1.4. Accepted_Campaign variable to sum up the fives¶
df['Accepted_Campaign'] = df[['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2']].sum(axis=1)
prop_accepted_campaign = df['Accepted_Campaign'].value_counts(normalize=True)
sns.countplot(x=df['Accepted_Campaign'], palette='magma', hue=df['Accepted_Campaign'], legend=False)
for i, prop in enumerate(prop_accepted_campaign):
plt.text(i, prop * len(df), f'{prop:.2%}', ha='center', va='bottom', color='black', fontsize=12)
plt.title('Distribution of Total Accepted Campaigns');
The distribution of accepted campaigns among customers is skewed, with most customers (approximately 79%) not accepting any campaigns. Around 15% of customers accepted one campaign, while smaller proportions engaged with two, three, or four campaigns, with the latter category having negligible representation.
2.1.5. Create Dependents variable based on Kidhome and Teenhome¶
df['Dependents'] = df['Kidhome'] + df['Teenhome']
2.1.6. Grouping Marital_Status¶
group_mapping = {
'Married': 'Relationship',
'Together': 'Relationship',
'Single': 'Single',
'Divorced': 'Single',
'Widow': 'Single',
'Alone': 'Single',
'Absurd': 'Single',
'YOLO': 'Single'
}
df['Marital_Status'] = df['Marital_Status'].map(group_mapping)
2.1.7. Grouping Education into smaller groups¶
education_mapping = {
'Graduation': 'Graduation',
'PhD': 'PhD',
'Master': 'Master',
'Basic': 'Basic',
'2n Cycle': 'Master'
}
df['Education'] = df['Education'].map(education_mapping)
plt.figure(figsize=(12, 3.5))
categories = ['Dependents', 'Marital_Status', 'Education']
palette = 'magma'
for i, col in enumerate(categories, 1):
plt.subplot(1, 3, i)
order = df[col].value_counts().index
sns.countplot(data=df, x=col, palette=palette, order=order, hue=col, legend=False)
plt.title(f'Count of {col}')
plt.ylabel('')
plt.tight_layout();
2.1.8. Summary of Newly-Created Categorical Data¶
Dependents- The majority of customers have 1 dependent, with over 1000 individuals falling into this category.
- Customers with 0 dependents are the next most common, totaling around 600.
- The count for customers with 2 dependents ranks third, comprising approximately 400 individuals.
- The count for customers with 3 dependents is the lowest, and there are no instances of customers having more than 3 dependents.
Marital Status- The dataset shows a higher number of customers in a relationship compared to single customers, with nearly 1500 and 750 individuals, respectively.
Education- Distribution shape of
Educationis similar toDependents, with the highest number of customers fall into the "Graduation" category. - "Master" and "PhD" categories have the second and third-highest counts, with around 600 and 500 customers, respectively.
- The "Basic" education category has the lowest count among the education levels. g the customers.
- Distribution shape of
2.2. Visualize and transform numerical features¶
numeric_cols = [
'Income', 'Recency',
'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
'NumDealsPurchases',
'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases',
'NumWebVisitsMonth'
]
fig, axes = plt.subplots(5, 3, figsize=(15, 20))
for i, col in enumerate(numeric_cols):
ax = axes[i // 3, i % 3]
sns.histplot(data=df, x=col, ax=axes[i // 3, i % 3])
ax.set_xlabel('')
ax.set_ylabel('')
ax.set_title(f'Histogram of {col}')
plt.tight_layout();
2.2.1. Summary of the Histograms¶
- The histograms for most features, except
RecencyandNumWebVisitsMonth, exhibit right-skewness with a concentration of outliers at the high end. - The
Incomehistogram contains outliers, hindering a clear view of the underlying distribution. Further examination of these outliers is necessary. - Considering the similarities in certain features, combining them to create new variable such as total cost could enhance the analysis.
- The uniform distribution of
Recencysuggests limited discriminatory power for customer segmentation.
2.2.2. Examine outliers of Income¶
median_income = df['Income'].median()
df['Income'] = df['Income'].fillna(median_income)
df[df['Income'] > 100000]
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | Year_Enrolled | Accepted_Campaign | Dependents | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 124 | 7215 | 1983 | Graduation | Single | 101970.00 | 0 | 0 | 12-03-2013 | 69 | 722 | 27 | 102 | 44 | 72 | 168 | 0 | 6 | 8 | 13 | 2 | 0 | 1 | 1 | 1 | 0 | 0 | 3 | 11 | 1 | 2013-03-12 | 3 | 0 |
| 164 | 8475 | 1973 | PhD | Relationship | 157243.00 | 0 | 1 | 01-03-2014 | 98 | 20 | 2 | 1582 | 1 | 2 | 1 | 15 | 0 | 22 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 2014-03-01 | 0 | 1 |
| 203 | 2798 | 1977 | PhD | Relationship | 102160.00 | 0 | 0 | 02-11-2012 | 54 | 763 | 29 | 138 | 76 | 176 | 58 | 0 | 7 | 9 | 10 | 4 | 0 | 1 | 1 | 1 | 0 | 0 | 3 | 11 | 1 | 2012-11-02 | 3 | 0 |
| 252 | 10089 | 1974 | Graduation | Single | 102692.00 | 0 | 0 | 05-04-2013 | 5 | 168 | 148 | 444 | 32 | 172 | 148 | 1 | 6 | 9 | 13 | 2 | 0 | 1 | 1 | 1 | 1 | 0 | 3 | 11 | 1 | 2013-04-05 | 4 | 0 |
| 617 | 1503 | 1976 | PhD | Relationship | 162397.00 | 1 | 1 | 03-06-2013 | 31 | 85 | 1 | 16 | 2 | 1 | 2 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 2013-06-03 | 0 | 2 |
| 646 | 4611 | 1970 | Graduation | Relationship | 105471.00 | 0 | 0 | 21-01-2013 | 36 | 1009 | 181 | 104 | 202 | 21 | 207 | 0 | 9 | 8 | 13 | 3 | 0 | 0 | 1 | 1 | 0 | 0 | 3 | 11 | 1 | 2013-01-21 | 2 | 0 |
| 655 | 5555 | 1975 | Graduation | Single | 153924.00 | 0 | 0 | 07-02-2014 | 81 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 2014-02-07 | 0 | 0 |
| 687 | 1501 | 1982 | PhD | Relationship | 160803.00 | 0 | 0 | 04-08-2012 | 21 | 55 | 16 | 1622 | 17 | 3 | 4 | 15 | 0 | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 2012-08-04 | 0 | 0 |
| 1300 | 5336 | 1971 | Master | Relationship | 157733.00 | 1 | 0 | 04-06-2013 | 37 | 39 | 1 | 9 | 2 | 0 | 8 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 2013-06-04 | 0 | 1 |
| 1653 | 4931 | 1977 | Graduation | Relationship | 157146.00 | 0 | 0 | 29-04-2013 | 13 | 1 | 0 | 1725 | 2 | 1 | 1 | 0 | 0 | 28 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 2013-04-29 | 0 | 0 |
| 1898 | 4619 | 1945 | PhD | Single | 113734.00 | 0 | 0 | 28-05-2014 | 9 | 6 | 2 | 3 | 1 | 262 | 3 | 0 | 27 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 2014-05-28 | 0 | 0 |
| 2132 | 11181 | 1949 | PhD | Relationship | 156924.00 | 0 | 0 | 29-08-2013 | 85 | 2 | 1 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 2013-08-29 | 0 | 0 |
| 2233 | 9432 | 1977 | Graduation | Relationship | 666666.00 | 1 | 0 | 02-06-2013 | 23 | 9 | 14 | 18 | 8 | 1 | 12 | 4 | 3 | 1 | 3 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 2013-06-02 | 0 | 1 |
Note:
There are 13 individuals with an income exceeding 100,000, all of whom possess education starting from 'Graduation' level and beyond. The highest reported income among them stands at 666,666.
Now let's re-plot:
plt.figure(figsize=(5, 3.5))
sns.histplot(data=df[~(df['Income'] > 100000)], x='Income')
plt.axvline(df['Income'].mean(), color='red', linestyle='--', label=f"Mean of {df['Income'].mean():.0f}")
plt.title('Income histogram when removing outliers')
plt.legend();
Note:
The distribution of Income when removing outliers now is pretty normal, with center around 52k.
2.2.3. Create Day_Enrolled¶
threshold = df['Year_Enrolled'].max()
df['Day_Enrolled'] = (threshold - df['Year_Enrolled']).dt.days
2.2.4. Create Age¶
threshold = df['Year_Enrolled'].dt.year.max()
df['Age'] = threshold - df['Year_Birth']
plt.figure(figsize=(10, 3.5))
plt.subplot(1, 2, 1)
sns.histplot(data=df, x='Age')
plt.title('Age histogram');
plt.subplot(1, 2, 2)
sns.histplot(data=df, x='Day_Enrolled')
plt.title('Day_Enrolled histogram')
plt.ylabel('');
The
Agehistogram reveals a relatively normal distribution, except for some outliers at the high end, particularly individuals aged over 100 years. Further investigation is required to understand and address these outliers.In contrast, the
Day_Enrolledhistogram depicts a uniform distribution, indicating minimal utility for the clustering model. This feature may not significantly contribute to the segmentation process. tions.
2.2.5. Check Age's outliers¶
df[df['Age'] > 80]
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | Year_Enrolled | Accepted_Campaign | Dependents | Day_Enrolled | Age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 192 | 7829 | 1900 | Master | Single | 36640.00 | 1 | 0 | 26-09-2013 | 99 | 15 | 6 | 8 | 7 | 4 | 25 | 1 | 2 | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 11 | 0 | 2013-09-26 | 0 | 1 | 276 | 114 |
| 239 | 11004 | 1893 | Master | Single | 60182.00 | 0 | 1 | 17-05-2014 | 23 | 8 | 0 | 5 | 7 | 0 | 2 | 1 | 1 | 0 | 2 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 2014-05-17 | 0 | 1 | 43 | 121 |
| 339 | 1150 | 1899 | PhD | Relationship | 83532.00 | 0 | 0 | 26-09-2013 | 36 | 755 | 144 | 562 | 104 | 64 | 224 | 1 | 4 | 6 | 4 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 3 | 11 | 0 | 2013-09-26 | 1 | 0 | 276 | 115 |
Note:
Only 3 people more than 80 years old but more interestingly, they are all above 110 years old, with the oldest is 121 years. I will remove them from our data.
plt.figure(figsize=(5, 3.5))
sns.histplot(data=df[~(df['Age'] > 80)], x='Age')
plt.axvline(df['Age'].mean(), color='red', linestyle='--', label= f"Mean of {df['Age'].mean():.0f}")
plt.title('Age histogram when removing outliers')
plt.legend();
Note:
The Age distribution exhibits a normal pattern after the removal of outliers.
2.2.6. Create Total_Cost¶
df['Total_Cost'] = df['MntWines'] + df['MntFruits'] + df['MntMeatProducts'] + df['MntFishProducts'] + df['MntSweetProducts'] + df['MntGoldProds']
plt.figure(figsize=(5, 3.5))
sns.histplot(data=df, x='Total_Cost', bins=range(0,2501,100))
plt.title('Total_Cost histogram');
2.3. Drop redundant variables¶
corr = df.corr(numeric_only=True)
mask = np.triu(np.ones(corr.shape), k=1).astype(np.bool_)
corr = corr.where(mask)
corr = corr.unstack().sort_values(ascending=False).to_frame(name='Correlation')
corr = corr[corr['Correlation'] != 1].reset_index()
corr = corr.dropna()
corr.head(20)
| level_0 | level_1 | Correlation | |
|---|---|---|---|
| 0 | Total_Cost | MntWines | 0.89 |
| 1 | Total_Cost | MntMeatProducts | 0.84 |
| 2 | Total_Cost | NumCatalogPurchases | 0.78 |
| 3 | NumCatalogPurchases | MntMeatProducts | 0.72 |
| 4 | Accepted_Campaign | AcceptedCmp5 | 0.72 |
| 5 | Dependents | Teenhome | 0.70 |
| 6 | Dependents | Kidhome | 0.69 |
| 7 | Accepted_Campaign | AcceptedCmp1 | 0.68 |
| 8 | Total_Cost | NumStorePurchases | 0.67 |
| 9 | Total_Cost | Income | 0.66 |
| 10 | Total_Cost | MntFishProducts | 0.64 |
| 11 | NumStorePurchases | MntWines | 0.64 |
| 12 | NumCatalogPurchases | MntWines | 0.64 |
| 13 | Accepted_Campaign | AcceptedCmp4 | 0.61 |
| 14 | Total_Cost | MntFruits | 0.61 |
| 15 | Total_Cost | MntSweetProducts | 0.60 |
| 16 | MntFishProducts | MntFruits | 0.59 |
| 17 | NumCatalogPurchases | Income | 0.59 |
| 18 | MntSweetProducts | MntFishProducts | 0.58 |
| 19 | MntMeatProducts | Income | 0.58 |
Note:
It's evident that the newly-created variables are inherently correlated with the variables from which they are derived. Therefore, I've decided to retain only the total variables in order to mitigate dimensionality and simplify the model's complexity.
cleaned_df = df.copy().drop(
columns=
[
'Year_Birth',
'ID',
'Z_CostContact', 'Z_Revenue',
'Dt_Customer',
'Year_Enrolled',
'Complain',
'Recency',
'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', # Total Cost
'AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', # Accepted_Campaign
'Kidhome', 'Teenhome' # Total Dependents
]
)
print(cleaned_df.shape)
cleaned_df.head()
(2240, 14)
| Education | Marital_Status | Income | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | Response | Accepted_Campaign | Dependents | Day_Enrolled | Age | Total_Cost | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Graduation | Single | 58138.00 | 3 | 8 | 10 | 4 | 7 | 1 | 0 | 0 | 663 | 57 | 1617 |
| 1 | Graduation | Single | 46344.00 | 2 | 1 | 1 | 2 | 5 | 0 | 0 | 2 | 113 | 60 | 27 |
| 2 | Graduation | Relationship | 71613.00 | 1 | 8 | 2 | 10 | 4 | 0 | 0 | 0 | 312 | 49 | 776 |
| 3 | Graduation | Relationship | 26646.00 | 2 | 2 | 0 | 4 | 6 | 0 | 0 | 1 | 139 | 30 | 53 |
| 4 | PhD | Relationship | 58293.00 | 5 | 5 | 3 | 6 | 5 | 0 | 0 | 1 | 161 | 33 | 422 |
2.4. Correlation Analysis¶
corr = cleaned_df.corr(numeric_only=True)
plt.figure(figsize=(9, 7))
sns.heatmap(corr, annot=True, cmap=sns.diverging_palette(145, 300, s=60, as_cmap=True),
linecolor='w', linewidth=0.1, fmt='.2f', mask=np.triu(corr), annot_kws={'size':9});
Some highlights:
IncomeandTotal_Cost: The correlation betweenIncomeandTotal_Costis notably strong (0.66). This suggests that customers with higher incomes tend to have higher total costs.NumWebVisitsMonthandTotal_Purchases: There is a relatively strong negative correlation betweenNumWebVisitsMonthandTotal_Purchases(-0.43). This implies that as the number of web visits per month decreases, the total number of purchases tends to increase.Accepted_CampaignandResponse: The variablesAccepted_CampaignandResponsehave a positive correlation of 0.43. This suggests that customers who responded positively to campaigns are more likely to have accepted those campaigns.Total_CostandTotal_Purchases:Total_CostandTotal_Purchasesare highly positively correlated (0.82). This indicates that higher total costs correspond to a higher number of total purchases.AgeandTotal_Purchases:AgeandTotal_Purchaseshave a positive correlation of 0.16. This indicates that, in general, older customers tend to have more total purchases.DependentsandNumDealsPurchases:DependentsandNumDealsPurchaseshave a relatively strong positive correlation (0.44). This suggests that customers with more dependents tend to make more deal purchases.
education_dummies = pd.get_dummies(cleaned_df['Education'], prefix='Education', dtype=int)
marital_dummies = pd.get_dummies(cleaned_df['Marital_Status'], prefix='Marital_Status', dtype=int)
education_dummies.shape, marital_dummies.shape
((2240, 4), (2240, 2))
3.2. Scale the Data¶
- I will scale only numerical data and then concatenate them back with encoded
education_dummiesandmarital_dummies
numerical_names = corr.columns.tolist()
numerical_data = cleaned_df[numerical_names].copy()
numerical_data.shape
(2240, 12)
scaler = StandardScaler()
scaled_data = scaler.fit_transform(numerical_data)
scaled_data = pd.DataFrame(scaled_data, columns=numerical_names)
scaled_data = pd.concat([education_dummies, marital_dummies, scaled_data], axis=1)
scaled_data.head()
| Education_Basic | Education_Graduation | Education_Master | Education_PhD | Marital_Status_Relationship | Marital_Status_Single | Income | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | Response | Accepted_Campaign | Dependents | Day_Enrolled | Age | Total_Cost | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0.24 | 0.35 | 1.41 | 2.51 | -0.55 | 0.69 | 2.39 | -0.44 | -1.26 | 1.53 | 0.99 | 1.68 |
| 1 | 0 | 1 | 0 | 0 | 0 | 1 | -0.24 | -0.17 | -1.11 | -0.57 | -1.17 | -0.13 | -0.42 | -0.44 | 1.40 | -1.19 | 1.24 | -0.96 |
| 2 | 0 | 1 | 0 | 0 | 1 | 0 | 0.77 | -0.69 | 1.41 | -0.23 | 1.30 | -0.54 | -0.42 | -0.44 | -1.26 | -0.21 | 0.32 | 0.28 |
| 3 | 0 | 1 | 0 | 0 | 1 | 0 | -1.02 | -0.17 | -0.75 | -0.91 | -0.55 | 0.28 | -0.42 | -0.44 | 0.07 | -1.06 | -1.27 | -0.92 |
| 4 | 0 | 0 | 0 | 1 | 1 | 0 | 0.24 | 1.38 | 0.33 | 0.12 | 0.06 | -0.13 | -0.42 | -0.44 | 0.07 | -0.95 | -1.02 | -0.31 |
scaled_data.shape
(2240, 18)
4. Dimensionality Reduction¶
Dimensionality reduction is a crucial technique in data analysis and machine learning. It serves various purposes, including:
Curse of Dimensionality: With an increase in the number of features, data volume grows exponentially. This can lead to storage, computational, and analysis challenges. High dimensions often result in sparse data, making it harder to extract meaningful insights.
Improved Model Performance: Some machine learning algorithms struggle with high-dimensional data, leading to overfitting. Dimensionality reduction can alleviate this issue by reducing noise and improving model generalization.
Visualization: Visualizing data in high dimensions is challenging. Dimensionality reduction techniques project data into lower-dimensional spaces (e.g., 2D or 3D), aiding visualization and understanding.
Noise Reduction: High-dimensional data often contains noise or irrelevant features. Dimensionality reduction can filter out noise and retain essential information.
In our data (scaled_data), we currently have 18 features, which represent a 18-dimensional space. Managing and interpreting data in such a high-dimensional space can be complex and resource-intensive. To address this, Principal Component Analysis (PCA) will be applied to reduce dimensionality while retaining the most critical information.
4.1. Initialize PCA object¶
Let's start with n_components = 3. Similar to my other projects, this serves as baseline model when I choose randomly a number to split our data.
%time
pca = PCA(n_components=3)
pca.fit(scaled_data)
CPU times: total: 0 ns Wall time: 0 ns
PCA(n_components=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=3)
pca_data = pca.transform(scaled_data)
pca_data
array([[ 2.56736055, 1.96590984, 1.60191808],
[-1.93787147, -0.89604036, -1.03074577],
[ 1.6058172 , -0.37468525, -0.93492673],
...,
[ 1.60791258, -1.19434595, 0.44511462],
[ 1.571549 , -0.34181938, -1.86821953],
[-1.09282977, 1.7029187 , 0.97572601]])
Let's plot the result on the scree plot:
per_var = np.round(pca.explained_variance_ratio_ * 100, decimals=1)
labels = ['PC' + str(x) for x in range(1, len(per_var) + 1)]
plt.figure(figsize=(6, 4))
sns.barplot(x=labels, y=per_var, palette='magma', hue=labels, legend=False)
plt.ylabel('Percentage of Explained Variance (%)')
plt.xlabel('Principal Component')
plt.title('Scree Plot')
plt.tight_layout()
sns.despine()
for i, v in enumerate(per_var):
plt.text(i, v + 1, str(v) + '%', color='black', ha='center', va='bottom')
Note: With the specified number of components (n=3). The variance explained by each of the first three principal components is as follows:
- PC1: Explains approximately 31.8% of the total variance.
- PC2: Explains approximately 13.7% of the total variance.
- PC3: Explains approximately 10.7% of the total variance.
These percentages indicate the proportion of the original variance retained in each principal component. Since we're using the first three principal components, the PCA has captured a total of around 56.2% of the total variance in the data.
This reduction in dimensionality will help streamline your data representation while preserving a substantial portion of its variability.
pca_df = pd.DataFrame(pca_data, columns=labels)
pca_df
pc1_values = pca_df['PC1']
pc2_values = pca_df['PC2']
pc3_values = pca_df['PC3']
fig = px.scatter_3d(pca_df, x=pc1_values, y=pc2_values, z=pc3_values)
fig = px.scatter_3d(pca_df, x=pc1_values, y=pc2_values, z=pc3_values, title='3D Scatter Plot of PC1, PC2, and PC3')
fig.update_layout(scene=dict(xaxis_title='PC1 - {0}%'.format(per_var[0]),
yaxis_title='PC2 - {0}%'.format(per_var[1]),
zaxis_title='PC3 - {0}%'.format(per_var[2])))
fig.show();
4.2. Identify the Optimal Number of Components¶
When deciding the number of principal components to retain, we will encounter a trade-off between reducing dimensionality and preserving information. This trade-off is influenced by the threshold we set for the variance explained.
Higher Threshold (e.g., 95% Variance Explained): Choosing a higher threshold, like capturing 95% variance explained, emphasizes retaining as much information as possible. This leads to keeping more principal components, resulting in a higher-dimensional representation of the data. While it preserves a significant portion of the original variability, it might necessitate more components.
Lower Threshold (e.g., 50% Variance Explained): Opting for a lower threshold, such as 50% variance explained, prioritizes dimensionality reduction. In this case, some information is sacrificed to achieve a substantial reduction in dimensionality. Although this approach results in a lower-dimensional data representation, it might not capture all subtle variations in the original data.
Trade-off: If preserving the majority of the original variability is crucial and computational resources allow, a higher threshold may be appropriate. On the other hand, if reducing dimensionality is a priority and a modest loss of information is acceptable, a lower threshold can be chosen.
To determine the optimal number of components, we will apply two criteria:¶
Total Variance Criterion:¶
- Explanation: This criterion focuses on retaining a sufficient number of principal components to explain a specified percentage of the total variance in the data. The idea is to capture the majority of the variability while reducing dimensionality.
- I will consider retaining a number of components that collectively explain for 50 to 70% of total variance but nearly 70% as possible as I can to reduce the data loss.
Kaiser Criterion:¶
- Explanation: The Kaiser criterion is based on the eigenvalues of the principal components. It suggests retaining components with eigenvalues greater than 1, as these components explain more variance than would be expected by chance.
- I will also employ the Kaiser criterion as well for choosing number of components.
4.2.1. Criterion #1 - Total Variance Criterion: 50-70% Cumulative Explained Variance¶
def find_components_near_threshold(data, from_threshold, to_threshold=None):
"""
Find the number of principal components that explain the specified variance range/value.
Parameters:
data (array-like): The scaled data.
from_threshold (float): The lower threshold for variance explained.
to_threshold (float): The upper threshold for variance explained (optional).
Returns:
str: A string indicating the number of components that explain the specified variance range/value.
"""
pca = PCA(n_components='mle', random_state=42)
pca.fit(data)
cumulative_variance = 0
n_from = None
n_to = None
# Return a n_component if to_threshold = None
for n in range(1, len(pca.explained_variance_ratio_) + 1):
cumulative_variance += pca.explained_variance_ratio_[n - 1]
if cumulative_variance >= from_threshold:
n_from = n
break
# Return a list of n_components if to_threshold != None
if to_threshold is not None:
for n in range(n_from + 1, len(pca.explained_variance_ratio_) + 1):
cumulative_variance += pca.explained_variance_ratio_[n - 1]
if cumulative_variance >= to_threshold:
n_to = n
break
if n_from is None:
return f"No components explained for at least {from_threshold*100}% variance"
elif n_to is None:
return f"Number of components explained for at least {from_threshold*100}% variance is {n_from}"
else:
return f"Numbers of components explained for {from_threshold*100}% to {to_threshold*100}% variance are {', '.join(map(str, range(n_from, n_to)))}"
result1 = find_components_near_threshold(scaled_data, 0.5)
result2 = find_components_near_threshold(scaled_data, 0.5, 0.7)
print(result1)
print()
print(result2)
Number of components explained for at least 50.0% variance is 3 Numbers of components explained for 50.0% to 70.0% variance are 3, 4
I will create a table showing Explained Variance by using pca.explained_variance_ratio_ so we can observe more easily:
variance_df = pd.DataFrame(columns=['Explained Variance', 'Cumulative Explained Variance'])
for i in range(1, len(scaled_data.columns) + 1):
pca = PCA(n_components=i)
pca.fit(scaled_data)
explained_variance = pca.explained_variance_ratio_.round(2)[i-1]
cumulative_variance = pca.explained_variance_ratio_.sum().round(2)
variance_df.loc[f'{i}'] = [explained_variance, cumulative_variance]
variance_df.index.name = 'n_components'
variance_df.T
| n_components | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Explained Variance | 0.32 | 0.14 | 0.11 | 0.08 | 0.06 | 0.05 | 0.04 | 0.04 | 0.03 | 0.03 | 0.03 | 0.02 | 0.02 | 0.02 | 0.01 | 0.00 | 0.00 | 0.00 |
| Cumulative Explained Variance | 0.32 | 0.46 | 0.56 | 0.64 | 0.70 | 0.75 | 0.80 | 0.83 | 0.87 | 0.90 | 0.92 | 0.95 | 0.97 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 |
Let's plot:
plt.figure(figsize=(6, 4))
sns.lineplot(data=variance_df, x=variance_df.index, y='Cumulative Explained Variance', marker='o', label='Cumulative Explained Variance')
plt.axhline(y=0.5, color='c', linestyle='--', label='0.5 Threshold')
plt.axhline(y=0.7, color='c', linestyle='--', label='0.7 Threshold')
plt.title('Explained Variance vs. Number of PCA Components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.legend()
plt.tight_layout();
Note:
As the result in the function, we have n_components = [3, 4, 5] as Explained Variance is in 50-70% range.
Now let's move to the next criterion.
4.2.2. Criterion #2 - Kaiser Criterion: Eigenvalues > 1¶
variance_df['Eigenvalue'] = pca.explained_variance_
variance_df.T
| n_components | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Explained Variance | 0.32 | 0.14 | 0.11 | 0.08 | 0.06 | 0.05 | 0.04 | 0.04 | 0.03 | 0.03 | 0.03 | 0.02 | 0.02 | 0.02 | 0.01 | 0.00 | 0.00 | 0.00 |
| Cumulative Explained Variance | 0.32 | 0.46 | 0.56 | 0.64 | 0.70 | 0.75 | 0.80 | 0.83 | 0.87 | 0.90 | 0.92 | 0.95 | 0.97 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 |
| Eigenvalue | 4.17 | 1.80 | 1.40 | 1.02 | 0.83 | 0.64 | 0.56 | 0.48 | 0.43 | 0.41 | 0.36 | 0.32 | 0.27 | 0.22 | 0.15 | 0.03 | 0.00 | 0.00 |
plt.figure(figsize=(8, 4))
sns.lineplot(data=variance_df, x=variance_df.index, y='Eigenvalue', marker='o')
plt.xticks(variance_df.index)
plt.axhline(y=1, color='r', linestyle='--', label='Kaiser Criterion: Eigenvalue > 1')
plt.title('Scree Plot with Kaiser Criterion')
plt.legend();
def find_components_eigenvalue_criterion(data):
"""
Find the number of principal components based on the Kaiser Criterion (Eigenvalues > 1).
Parameters:
data (array-like): The scaled data.
Returns:
str: A string indicating the number of components based on the Kaiser Criterion.
"""
cov_matrix = np.cov(data.astype(float), rowvar=False)
eigenvalues = np.linalg.eigvals(cov_matrix)
n_components = np.sum(eigenvalues > 1)
return f"Number of components based on Kaiser Criterion (Eigenvalues > 1): {n_components}."
find_components_eigenvalue_criterion(scaled_data)
'Number of components based on Kaiser Criterion (Eigenvalues > 1): 4.'
4.3. Conclusion about n_components¶
Criterion 1: Explained Variance (50-70%)
When using the criterion of explained variance, we aim to find the number of principal components that collectively capture a substantial portion of the variability in the data. In this case, we are interested in components that explain between 50% and 70% of the total variance in the dataset. Using the function find_components_near_threshold, we determine that for this range, the optimal number of components is 3, 4, and 5. These components strike a balance between dimensionality reduction and information retention.
Criterion 2: Kaiser Criterion (Eigenvalues > 1)
The Kaiser Criterion involves examining the eigenvalues associated with the principal components. Eigenvalues indicate the amount of variance explained by each principal component. A common guideline is to retain only those components with eigenvalues greater than 1. When applying this criterion using the function find_components_eigenvalue_criterion, we identify that the first 4 components have eigenvalues exceeding 1. These components contribute significantly to the variability in the data.
n_components = 4 to not only aligns with the Kaiser Criterion by retaining components with eigenvalues greater than 1, but it also satisfies the Explained Variance Criterion by capturing a substantial portion of the variability in the data.
pca = PCA(n_components=4)
pca.fit(scaled_data)
pca_final = pd.DataFrame(
pca.transform(scaled_data),
columns = (
[
"PC1",
"PC2",
"PC3",
"PC4",
]
)
)
pca_final.head()
| PC1 | PC2 | PC3 | PC4 | |
|---|---|---|---|---|
| 0 | 2.57 | 1.97 | 1.60 | 0.08 |
| 1 | -1.94 | -0.90 | -1.03 | 1.71 |
| 2 | 1.61 | -0.37 | -0.93 | -0.88 |
| 3 | -1.99 | -1.12 | 0.24 | -0.58 |
| 4 | -0.25 | 0.30 | -0.74 | -0.41 |
Note: By reducing the dimensionality from 18 features to 4 principal components, we've effectively simplified the dataset while retaining the most critical information. This streamlined dataset is now more suitable for further analysis and modeling while avoiding the complexities associated with high-dimensional data.
5. Customer Segmentation using K-Means Clustering¶
In the next step of the analysis, I'll focus on customer segmentation using the K-Means clustering algorithm. To determine the optimal number of clusters (k) for our K-Means algorithm, I will employ the Elbow Method and the Silhouette Score to choose the optimal number of clusters.
Elbow Method:
- Visual approach to determine optimal cluster count
- Run K-Means for different k values
- Plot WCSS against the number of clusters
- Identify "elbow point" where WCSS reduction slows
- Indicates balance between minimizing variance and overfitting
Silhouette Score:
- Measures object similarity to its cluster and others
- Score range: -1 to 1
- Higher score means object matches its cluster well
- Helps evaluate cluster quality and select best k value
5.1. Initialize model¶
ranges=range(1,11)
inertia = []
sil_scores = []
for n in ranges:
kmeans = KMeans(n_clusters=n,
n_init=10,
algorithm ='lloyd',
random_state=42)
kmeans.fit(pca_final)
inertia.append(kmeans.inertia_)
if n > 1:
sil_score = silhouette_score(pca_final, kmeans.labels_)
sil_scores.append(sil_score)
else:
sil_scores.append(0)
5.2. Choose k¶
plt.figure(figsize=(9, 4))
plt.subplot(1, 2, 1)
plt.plot(ranges, inertia, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k (Inertia)')
plt.subplot(1, 2, 2)
plt.plot(ranges, sil_scores, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score for Optimal k')
plt.tight_layout();
Another way to plot:
kmeans = KMeans(n_init=10, random_state=42)
elbow_visualizer = KElbowVisualizer(kmeans, k=(2,10))
elbow_visualizer.fit(pca_final)
elbow_visualizer.show()
<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
elbow_visualizer = KElbowVisualizer(kmeans, k=(2,10), metric='silhouette')
elbow_visualizer.fit(pca_final)
elbow_visualizer.show();
5.3. Determining the Optimal Number of Clusters: Elbow vs. Silhouette¶
In our quest to determine the optimal number of clusters for customer segmentation, we employed two distinct methods:
Elbow Method: This method revealed a clear inflection point, or "elbow," at 4 clusters.
Silhouette Score: The Silhouette Score, which measures the quality of clustering, was highest for 2 clusters but also exhibited a substantial peak at 4 clusters.
Harmonization: To make an informed decision, I sought harmony between the insights gained from the Elbow Method and the Silhouette Score. Notably, we observed a convergence between the second peak at 4 clusters in the Silhouette Score and the elbow point at 4 clusters in the Elbow Method. This led us to the final decision of 4 clusters for our customer segmentation. This harmonious choice elegantly balances the reduction in inertia and the high quality of clustering.
kmeans = KMeans(n_clusters=4,
n_init=10,
random_state=42)
kmeans.fit(pca_final)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
pca_final['Cluster'] = labels
df['Cluster'] = labels
fig = plt.figure(figsize=(10,7))
ax = plt.subplot(projection='3d')
ax.scatter(xs = pca_final["PC1"],
ys = pca_final["PC2"],
zs = pca_final["PC3"],
s=50,
c= pca_final["Cluster"],
marker='o',
alpha = 0.5,
cmap = 'magma')
plt.title("Static 3D Clusters");
fig = px.scatter_3d(pca_final, x='PC1', y='PC2', z='PC3', color='Cluster', opacity=0.1, color_continuous_scale='magma')
fig.update_layout(title='Interactive 3D Clusters with Centroids')
fig.add_scatter3d(x=centroids[:, 0], y=centroids[:, 1], z=centroids[:, 2],
mode='markers',
marker=dict(size=7, color='red', symbol='x'),
showlegend=False)
6. Evaluate the Clusters¶
df.head(2)
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | Year_Enrolled | Accepted_Campaign | Dependents | Day_Enrolled | Age | Total_Cost | Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.00 | 0 | 0 | 04-09-2012 | 58 | 635 | 88 | 546 | 172 | 88 | 88 | 3 | 8 | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 | 2012-09-04 | 0 | 0 | 663 | 57 | 1617 | 2 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.00 | 1 | 1 | 08-03-2014 | 38 | 11 | 1 | 6 | 2 | 1 | 6 | 2 | 1 | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 2014-03-08 | 0 | 2 | 113 | 60 | 27 | 1 |
sns.countplot(x=df['Cluster'], palette='magma', hue=df['Cluster'], legend=False);
df['Cluster'].value_counts()
Cluster 1 988 3 596 0 440 2 216 Name: count, dtype: int64
pd.pivot_table(data=df, index='Cluster', values = corr.columns.to_list(), margins=True).T
| Cluster | 0 | 1 | 2 | 3 | All |
|---|---|---|---|---|---|
| Accepted_Campaign | 0.22 | 0.08 | 1.61 | 0.24 | 0.30 |
| Age | 48.45 | 42.52 | 44.46 | 47.49 | 45.19 |
| Day_Enrolled | 488.57 | 308.32 | 407.59 | 309.37 | 353.58 |
| Dependents | 1.37 | 1.22 | 0.19 | 0.47 | 0.95 |
| Income | 52398.10 | 34484.70 | 77157.74 | 72518.36 | 52237.98 |
| NumCatalogPurchases | 2.61 | 0.52 | 6.11 | 5.00 | 2.66 |
| NumDealsPurchases | 4.91 | 1.86 | 1.23 | 1.58 | 2.33 |
| NumStorePurchases | 6.83 | 3.19 | 7.70 | 8.65 | 5.79 |
| NumWebPurchases | 6.45 | 2.02 | 5.75 | 5.16 | 4.08 |
| NumWebVisitsMonth | 6.82 | 6.37 | 3.63 | 3.07 | 5.32 |
| Response | 0.17 | 0.07 | 0.87 | 0.00 | 0.15 |
| Total_Cost | 653.01 | 91.22 | 1521.67 | 1092.04 | 605.80 |
Education and Marital_Status between Clusters¶
plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
sns.countplot(data=df, x= 'Education', hue='Cluster', palette='magma');
plt.subplot(1,2,2)
sns.countplot(data=df, x= 'Marital_Status', hue='Cluster', palette='magma')
plt.tight_layout();
Note:
Education and Marital_Status appear to vary both within and among the customer segments, which means that these variables alone may not be the primary drivers of segment differentiation. In the context of the clustering analysis performed, it appears that other factors such as age, income, purchase behavior, and response to marketing campaigns have played a more significant role in distinguishing the segments.
7. Final Conclusion¶
7.1. Customer Segmentation¶
Segment 0: The Loyal Veterans¶
- Segment Size: This segment comprises 440 customers, representing the most long-standing customers of the company.
- Demographics: The customers in this segment are relatively older, with an average age of around 48 years. They have a moderate income, approximately $50k per year, and typically have 1 to 2 dependents.
- Website Visits: They exhibit frequent visits to the company's website, indicating active online engagement.
- Purchase Behavior: These customers are active purchasers both in-store and on the website. They show a preference for discounted items.
- Spending Habits: The segment has a moderate total spending (
Total_Cost) of around $650 in the last 2 years. - Campaign Response: Despite their loyalty, they demonstrate a low response rate to marketing campaigns (0.17), suggesting that traditional campaign strategies may not be highly effective for this group.
Segment 1: The Budget-Conscious Shoppers¶
- Segment Size: This is the largest segment with approximately 988 customers.
- Demographics: The customers in this segment are relatively younger, with an average age of 42. They have a lower income, around $34k, and usually have around 1 dependent.
- Purchase Behavior: Although they frequently visit the website, their purchasing activity is moderate, especially for catalog, deal, and store purchases.
- Website Visits: They show a high frequency of website visits but make fewer purchases compared to other segments, possibly due to budget constraints.
- Spending Habits: This segment has a low total spending (
Total_Cost) of $91.2 in the last 2 years. - Campaign Response: They exhibit a low response rate to marketing campaigns (0.07).
Segment 2: The Affluent Aficionados¶
- Segment Size: This segment consists of 216 customers, representing the smallest among the segments.
- Demographics: Customers in this segment have an average age and remarkably high income, averaging $77k, and often have no dependents.
- Purchase Behavior: They have a significant number of purchases through catalog, store, and web channels, with a reluctance to buy when there is a discount.
- Website Visits: Despite infrequent website visits (around 3.6 visits per month), they are prolific purchasers.
- Spending Habits: This segment exhibits a high total spending (
Total_Cost) of $1521 in the last 2 years. - Campaign Response: They show a very high response rate to marketing campaigns (0.87), indicating a strong positive response to campaigns, with an average of 1.61 accepted campaigns per month.
Segment 3: The Silent High Rollers¶
- Segment Size: This segment is composed of 596 customers.
- Demographics: Customers in this segment are relatively older, with high incomes and typically 0 or 1 dependents.
- Purchase Behavior: They are significant purchasers, engaging in numerous purchases per month.
- Website Visits: Despite infrequent website visits (around 3 visits per month), they are active buyers.
- Spending Habits: This segment exhibits a notable total spending (
Total_Cost) of $1092, indicating substantial spending on various products in the last 2 years. - Campaign Response: Surprisingly, they never respond to marketing campaigns (0.00).
7.2. Recommendations¶
Segment 0: The Loyal Veterans¶
Recommendations:
- Loyalty Programs: Enhance and promote loyalty programs tailored to the preferences of this segment to further strengthen customer retention.
- Personalized Campaigns: Implement personalized marketing campaigns based on their historical preferences to improve campaign response rates.
- Engagement Initiatives: Explore initiatives to increase engagement on the website, potentially offering exclusive online promotions.
Segment 1: The Budget-Conscious Shoppers¶
Recommendations:
- Budget-Friendly Promotions: Design marketing campaigns focused on budget-friendly promotions and discounts to resonate with the financial constraints of this segment.
- Educational Content: Provide educational content on how to maximize savings and value, fostering a sense of financial empowerment.
- Customer Feedback: Collect and analyze customer feedback to understand specific pain points and areas for improvement in the purchasing experience.
Segment 2: The Affluent Aficionados¶
Recommendations:
- Exclusive Events: Organize exclusive events or product launches for this segment to capitalize on their high income and purchasing power.
- Premium Experiences: Offer premium and personalized shopping experiences, recognizing their preference for non-discounted items.
- Targeted Campaigns: Create targeted marketing campaigns that align with their unique preferences, emphasizing quality and exclusivity.
Segment 3: The Silent High Rollers¶
Recommendations:
- Reactivation Campaigns: Implement reactivation campaigns to re-engage this segment, as they have shown potential with high incomes and significant past spending.
- Customer Surveys: Conduct surveys to understand the reasons behind their lack of response to marketing campaigns, helping tailor future strategies.
- Strategic Partnerships: Explore partnerships with high-end brands or exclusive offerings to capture their attention and drive engagement.
General Recommendation:¶
- Cross-Segment Analytics: Conduct comprehensive analytics across segments to identify overarching trends and opportunities that may benefit the entire customer base. This can inform global marketing strategies and improve overall customer satisfaction.Events:** Organize exclusive events or product launches for this segment to capitalize on their high income and purchasing power.
- Premium Experiences: Offer premium and personalized shopping experiences, recognizing their preference for non-discounted items.
- Targeted Campaigns: Create targeted marketing campaigns that align with their unique preferences, emphasizing quality and exclusivity.